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Abstract 



Cross-Unguistic similarities are reflected 
by the speech sound systems of languages 
all over the world. In this work we try 
to model such similarities observed in the 
consonant inventories, through a complex 
bipartite network. We present a systematic 
study of some of the appealing features of 
these inventories with the help of the bi- 
partite network. An important observation 
is that the occurrence of consonants fol- 
lows a two regime power law distribution. 
We find that the consonant inventory size 
distribution together with the principle of 
preferential attachment are the main rea- 
sons behind the emergence of such a two 
regime behavior. In order to further sup- 
port our explanation we present a synthe- 
sis model for this network based on the 
general theory of preferential attachment. 

1 Introduction 

Sound systems of the world's languages show 
remarkable regularities. Any arbitrary set 
of consonants and vowels does not make up 
the sound system of a particular language. 
Several lines of research suggest that cross- 
linguistic similarities get reflected in the 
consonant and vowel inventories of the lan- 
guages all over the world (IGreenberg, 19661 
[Pinker, 19941 [Ladefoged and Maddieson, 1996 1. 
Previously it has been argued that these 
similarities are the results of certain gen- 
eral principles like maximal perceptual 
contrast ( |Lindblom and Maddieson, 19881 ), 

feature economy dMartinet, 19681 

IBoersma, 19981 [Clements, 2004l l and 



robustness dJakobson and Halle, 19561 

Chomsky and Halle, 1968| l. Maximal percep- 



tual contrast between the phonemes of a language 
is desirable for proper perception in a noisy 
environment. In fact the organization of the vowel 
inventories across languages has been satisfac- 
torily explained in terms of the single principle 
of maximal perceptual contrast ( jJakobson, 19411 



Wang, 1968 1 



There have been several attempts to rea 
son the observed patterns in consonant in 
ventories since 1930s ( |Trubetzkoy, 1969/1939 



"Lmciblom and Maddi eson, 19881 jBoersma, 1998 
Fle mming, 2002| iClements, 2004| i, but unlike the 
case of vowels, the structure of consonant in- 
ventories lacks a complete and holistic explana- 
tion ( jde Boer, 2000| ). Most of the works are con- 
fined to certain individual principles fAbry, 20031 



Hinskens and Weijer, 2003.) rather than formulat- 



ing a general theory describing the structural pat- 
terns and/or their stability. Thus, the structure of 
the consonant inventories continues to be a com- 
plex jigsaw puzzle, though the parts and pieces are 
known. 

In this work we attempt to represent the cross- 
linguistic similarities that exist in the consonant 
inventories of the world's languages through a 
bipartite network named PlaNet (the Phoneme 
Language Network). PlaNet has two different sets 
of nodes, one labeled by the languages while the 
other labeled by the consonants. Edges run be- 
tween these two sets depending on whether or not 
a particular consonant occurs in a particular lan- 
guage. This representation is motivated by similar 
modeling of certain complex phenomena observed 
in nature and society, such as, 

• Movie-actor network, where movies and 



actors constitute the two partitions and 
an edge between them signifies that 
a particular actor acted in a particular 
movie ( P^amasco et al., 2004l i. 

• Article-author network, where the edges de- 
note which person has authored which arti- 
cles ( Newman, 2001bj . 

• Metabolic network of organisms, where the 
corresponding partitions are chemical com- 
pounds and metabolic reactions. Edges run 
between partitions depending on whether a 
particular compound is a substrate or result 
of a reaction ( Jeong et al., 2000 

Modeling of complex systems as networks 
has proved to be a comprehensive and emerging 
way of capturing the underlying generating 
mechanism of such systems (for a review 
on complex networks and their generation 
see ( Albert and Barabasi, 20021 [Newman, 2003t ). 
There have been some attempts as well to 
model the intricacies of human languages 
through complex networks. Word networks 
based on synonymy ( |Yook et aU 2001b) , co- 
occurrence ( ICancho et al., 200 f] i, and phonemic 
edit-distance ( |Vitevitch, 20 05) are examples 
of such attempts. The present work also uses 
the concept of complex networks to develop a 
platform for a holistic analysis as well as synthesis 
of the distribution of the consonants across the 
languages. 

In the current work, with the help of PlaNet we 
provide a systematic study of certain interesting 
features of the consonant inventories. An impor- 
tant property that we observe is the two regime 
power law degree distribution' of the nodes la- 
beled by the consonants. We try to explain this 
property in the light of the size of the consonant 
inventories coupled with the principle of preferen- 
tial attachment (Barabasi and Albert, 1999) . Next 
we present a simplified mathematical model ex- 
plaining the emergence of the two regimes. In or- 
der to support our analytical explanations, we also 
provide a synthesis model for PlaNet. 

The rest of the paper is organized into five sec- 
tions. In section 121 we formally define PlaNet, out- 
line its construction procedure and present some 

'Two regime power law distributions have 
also b een observed in syntactic networks of 
words fCancho et al., 2001 1, network of mathematics 
collaborators ( Grossman et al., 1995 1, and language diversity 
over countries jGomes et al., 1999t. 




Figure 1: Illustration of the nodes and edges of 
PlaNet 



studies on its degree distribution. We dedicate sec- 
tion |3l to state and explain the inferences that can 
be drawn from the degree distribution studies of 
PlaNet. In section |4l we provide a simplified the- 
oretical explanation of the analytical results ob- 
tained. In section |5l we present a synthesis model 
for PlaNet to hold up the inferences that we draw 
in section |3l Finally we conclude in section |6l by 
summarizing our contributions, pointing out some 
of the implications of the current work and indi- 
cating the possible future directions. 

2 PlaNet: The Phoneme-Language 
Network 

We define the network of consonants and lan- 
guages, PlaNet, as a bipartite graph represented as 
G = (y L, Vc E) where is the set of nodes la- 
beled by the languages and Vc is the set of nodes 
labeled by the consonants. E is the set of edges 
that run between N i and Vc. There is an edge e G 
E between two nodes vi GYl and Vc € Vc if and 
only if the consonant c occurs in the language /. 
Figure [^illustrates the nodes and edges of PlaNet. 

2.1 Construction of PlaNet 

Many typological stud- 

ies (|Lindblom an d Maddieson, 1988} 

Ladefoged and Maddies on, 1 9961 
Hinskens and Weijer, 2003| l of segmental in- 
ventories have been carried out in past on the 
UCLA Phonological Segment Inventory Database 
(UPSID) dMaddieson, T984l i. UPSID initially 
had 317 languages and was later extended to 
include 451 languages covering all the major 



language families of the world. In this work 
we have used the older version of UPSID com- 
prising of 317 languages and 541 consonants 
(henceforth UPSID317), for constructing PlaNet. 
Consequently, there are 317 elements (nodes) in 
the set Vi and 541 elements (nodes) in the set 
Vc. The number of elements (edges) in the set E 
as computed from PlaNet is 7022. At this point 
it is important to mention that in order to avoid 
any confusion in the construction of PlaNet we 
have appropriately filtered out the anomalous and 
the ambiguous segments (Maddieso n, 1984| ) from 
it. We have completely ignored the anomalous 
segments from the data set (since the existence 
of such segments is doubtful), and included the 
ambiguous ones as separate segments because 
there are no descriptive sources explaining how 
such ambiguities might be resolved. A similar 
approach has also been described in Pericliev and 
Valdes-Perez SMrii . 

2.2 Degree Distribution of PlaNet 

The degree of a node u, denoted by /c„ is defined as 
the number of edges connected to u. The term de- 
gree distribution is used to denote the way degrees 
{ku) are distributed over the nodes (u). The de- 
gree distribution studies find a lot of importance in 
understanding the complex topology of any large 
network, which is very difficult to visualize oth- 
erwise. Since PlaNet is bipartite in nature it has 
two degree distribution curves one corresponding 
to the nodes in the set and the other corre- 
sponding to the nodes in the set Vc*. 

Degree distribution of tlie nodes in Vl: Fig- 
ure |2l shows the degree distribution of the nodes 
inW L where the x-axis denotes the degree of each 
node expressed as a fraction of the maximum de- 
gree and the y-axis denotes the number of nodes 
having a given degree expressed as a fraction of 
the total number of nodes in Vj^ . 

It is evident from Figure |2l that the number of 
consonants appearing in different languages fol- 
low a /3-distribution ^ (see ( Buhner, 1979j for ref- 
erence). The figure shows an asymmetric right 

random variable is said to have a /3-distribution with 
parameters q > and /3 > if and only if its probability mass 
function is given by 

for < X < 1 and f(x) = otherwise, r(-) is the Euler's 
gamma function. 
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Figure 2: Degree distribution of PlaNet for the set 
Vl. The figure in the inner box is a magnified 
version of a portion of the original figure. 



skewed distribution with the values of a and (3 
equal to 7.06 and 47.64 (obtained using maximum 
likelihood estimation method) respectively. The 
asymmetry points to the fact that languages usu- 
ally tend to have smaller consonant inventory size, 
the best value being somewhere between 10 and 
30. The distribution peaks roughly at 21 indicating 
that majority of the languages in UPSID317 have a 
consonant inventory size of around 21 consonants. 

Degree distribution of the nodes in Vc: Fig- 
ure |3] illustrates two different types of degree dis- 
tribution plots for the nodes in Vc; Figure |3la) 
corresponding to the rank, i.e., the sorted order of 
degrees, (x-axis) versus degree (y-axis) and Fig- 
ure |3lb) corresponding to the degree (fc) (x-axis) 
versus P}^ (y-axis) where is the fraction of 
nodes having degree greater than or equal to k. 

Figure|3lclearly shows that both the curves have 
two distinct regimes and the distribution is scale- 
free. Regime 1 in Figure |3t a) consists of 21 con- 
sonants which have a very high frequency (i.e., 
the degree k) of occurrence. Regime 2 of Fig- 
ure l^b) also correspond to these 21 consonants. 
On the other hand Regime 2 of Figure l^a) as well 
as Regime 1 of Figure |3tb) comprises of the rest 
of the consonants. The point marked as x in both 
the figures indicates the breakpoint. Each of the 
regime in both Figure |3ta) and (b) exhibit a power 
law of the form 

y = Ax-'^ 

In Figure |3la) y represents the degree of a node 



3 
a* 



Z 

HI 
0) 

1_ 

ai 
u 

a 



100 



10 



liegime 1 




Regime 2 



10 100 1000 

Rank 



(a) 



ft 



0.1 



0.01 



0.001 



Regime 1 



Regime2 \ 



10 100 
Degree(l() 

(b) 



Figure 3: Degree distribution of PlaNet for the set 
Vc in a log-log scale 



corresponding to its rank x whereas in Figure |3lb) 
y corresponds to Pk and x, the degree k. The val- 
ues of the parameters A and a, for Regime 1 and 
Regime 2 in both the figures, as computed by the 
least square error method, are shown in Table [2 

It becomes necessary to mention here that such 
power law distributions, known variously as Zipf 's 
law ( jZipf, 194^ , are also observed in an extraor- 
dinarily diverse range of phenomena including 
the frequency of the use of words in human lan- 
guage ( Zipf, 194^ , the number of papers scien- 
tists write fLotka, 19261, the number of hits on 
web pages ( Adamic and H uberman, 2000 ) and so 
on. Thus our inferences, detailed out in the next 
section, mainly centers around this power law be- 
havior. 

3 Inferences Drawn from the Analysis of 
PlaNet 

In most of the networked systems Uke the society, 
the Internet, the World Wide Web, and many oth- 
ers, power law degree distribution emerges for the 
phenomenon of preferential attachment, i.e., when 
"the rich get richer" dSimon, 1955| l. With refer- 
ence to PlaNet this preferential attachment can be 



interpreted as the tendency of a language to choose 
a consonant that has been already chosen by a 
large number of other languages. We posit that it is 
this preferential property of languages that results 
in the power law degree distributions observed in 
Figure|3ta) and (b). 

Nevertheless there is one question that still re- 
mains unanswered. Whereas the power law distri- 
bution is well understood, the reason for the two 
distinct regimes (with a sharp break) still remains 
unexplored. We hypothesize that. 
Hypothesis The typical distribution of the conso- 
nant inventory size over languages coupled with 
the principle of preferential attachment enforces 
the two distinct regimes to appear in the power 
law curves. 

As the average consonant inventory size in 
UPSID317 is 21, so following the principle of 
preferential attachment, on an average, the first 
21 most frequent consonants are much more pre- 
ferred than the rest. Consequently, the nature of 
the frequency distribution for the highly frequent 
consonants is different from the less frequent ones, 
and hence there is a transition from Regime 1 to 
Regime 2 in the Figure|3la) and (b). 

Support Experiment: In order to establish that 
the consonant inventory size plays an important 
role in giving rise to the two regimes discussed 
above we present a support experiment in which 
we try to observe whether the breakpoint x shifts 
as we shift the average consonant inventory size. 
Experiment: In order to shift the average con- 
sonant inventory size from 21 to 25, 30 and 38 
we neglected the contribution of the languages 
with consonant inventory size less than n where 
n is 15, 20 and 25 respectively and subsequently 
recorded the degree distributions obtained each 
time. We did not carry out our experiments for 
average consonant inventory size more than 38 be- 
cause the number of such languages are very rare 
in UPSID317. 

Observations: Figure |3] shows the effect of this 
shifting of the average consonant inventory size on 
the rank versus degree distribution curves. Table|2 
presents the results observed from these curves 
with the left column indicating the average inven- 
tory size and the right column the breakpoint x. 
The table clearly indicates that the transition oc- 
curs at values corresponding to the average conso- 
nant inventory size in each of the three cases. 
Inferences: It is quite evident from our observa- 
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Table 1 : The values of the parameters A and a 
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Figure 4: Degree distributions at different average 
consonant inventory sizes 
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Table 2: The transition points for different average 
consonant inventory size 

tions that the breakpoint x has a strong correlation 
with the average consonant inventory size, which 
therefore plays a key role in the emergence of the 
two regime degree distribution curves. 

In the next section we provide a simplistic math- 
ematical model for explaining the two regime 
power law with a breakpoint corresponding to the 
average consonant inventory size. 

4 Theoretical Explanation for the Two 
Regimes 

Let us assume that the inventory of all the lan- 
guages comprises of 21 consonants. We further as- 
sume that the consonants are arranged in their hier- 
archy of preference. A language traverses the hier- 
archy of consonants and at every step decides with 
a probability p to choose the current consonant. It 
stops as soon as it has chosen all the 21 conso- 
nants. Since languages must traverse thi^ough the 
first 2 1 consonants regardless of whether the pre- 
vious consonants are chosen or not, the probability 
of choosing any one of these 21 consonants must 
be p. But the case is different for the 22"^^ conso- 



nant, which is chosen by a language if it has pre- 
viously chosen zero, one, two, or at most 20, but 
not all of the first 21 consonants. Therefore, the 
probability of the 22""' consonant being chosen is, 



P\l-P) 



21-i 



where 




p'il-p) 



21-i 



denotes the probability of choosing i consonants 
from the first 21. In general the probability of 
choosing the n-i-1*^ consonant from the hierarchy 
is given by. 



P{n + 1) 




p\i-pr- 



Figure 121 shows the plot of the function P{n) for 
various values of p which are 0.99, 0.95, 0.9, 0.85, 
0.75 and 0.7 respectively in log-log scale. All the 
curves, for different values of p, have a nature sim- 
ilar to that of the degree distribution plot we ob- 
tained for PlaNet. This is indicative of the fact that 
languages choose consonants from the hierarchy 
with a probability function comparable to P{n). 

Owing to the simplified assumption that all 
the languages have only 21 consonants, the first 
regime is a straight line; however we believe a 
more rigorous mathematical model can be built 
taking into consideration the /3-distribution rather 
than just the mean value of the inventory size that 
can explain the negative slope of the first regime. 
We look forward to do the same as a part of our fu- 
ture work. Rather, here we try to investigate the ef- 
fect of the exact distribution of the language inven- 
tory size on the nature of the degree distribution of 
the consonants through a synthetic approach based 
on the principle of preferential attachment, which 
is described in the subsequent section. 

5 The Synthesis Model based on 
Preferential Attachment 



Albert and Barabasi (I1999t observed that a com- 
mon property of many large networks is that the 



repeat 

for j = 1 to 317 do 

if there is a node Lj € with at least 
one or more consonants to be chosen 
from Vc then 

Compute Vj = Vc-V{Lj), where 
V{Lj) is the set of nodes in Vc to 
which Lj is akeady connected; 

end 

for each node i £ Vj do 

where ki is the current degree of 
the node i and e is the model 
pai^ameter. Pr{i) is the 
probabiUty of connecting Lj to i. 

end 

Connect Lj to a node i £ Vj 
following the distribution Pr{i); 

end 

until all languages complete their inventory 
quota ; 

Algorithm 1: Algorithm for synthesis of 




steps (n) 



Figure 5: Plot of the function P{n) in log-log 
scale 



vertex connectivities follow a scale-free power 
law distribution. They remarked that two generic 
mechanisms can be considered to be the cause 
of this observation: (i) networks expand contin- 
uously by the addition of new vertices, and (ii) 
new vertices attach preferentially to sites (vertices) 
that are already well connected. They found that 
a model based on these two ingredients repro- 
duces the observed stationary scale-free distribu- 
tions, which in turn indicates that the develop- 
ment of large networks is governed by robust self- 
organizing phenomena that go beyond the particu- 
lars of the individual systems. 

Inspired by their work and the empirical as well 
as the mathematical analysis presented above, we 
propose a preferential attachment model for syn- 
thesizing PlaNet (PlaNetsj^„ henceforth) in which 
the degree distribution of the nodes in is 
known. Hence N l={Li, L2, . . ., L317} have 
degrees (consonant inventory size) {ki, k2, ■ ■ ., 
fesiy} respectively. We assume that the nodes in 
the set Vc are unlabeled. At each time step, a 
node Lj (j = 1 to 317) from V^, tries to attach itself 
with a new node i G Vc to which it is not already 
connected. The probability Pr{i) with which the 
node Lj gets attached to i depends on the current 
degree of i and is given by 

where ki is the current degree of the node i, Vj 
is the set of nodes in Vc to which Lj is not al- 
ready connected and e is the smoothing parameter 
which is used to reduce bias and favor at least a 
few attachments with nodes in Vj that do not have 



PlaNet based on preferential attachment 

a high Pr{i). The above process is repeated until 
all Lj G Vl get connected to exactly kj nodes in 
Vc. The entire idea is summarized in Algorithm^ 
Figure |6l shows a partial step of the synthesis pro- 
cess illustrated in Algorithm [2 

Simulation Results: Simulations reveal that for 
PlaNetsj/n the degree distribution of the nodes be- 
longing to Vc fit well with the analytical results 
we obtained earlier in section |2l Good fits emerge 
for the range 0.06 < e < 0.08 with the best being 
at e = 0.0701. Figure0shows the degree k versus 
Pfc plots for e = 0.0701 averaged over 100 simula- 
tion runs. 

The mean error^ between the degree distribu- 
tion plots of PlaNet and PlaNetsy„ is 0.03 which 
intuitively signifies that on an average the varia- 
tion in the two curves is 3%. On the contrary, if 
there were no preferential attachment incorporated 
in the model (i.e., all connections were equiprob- 

^Mean error is defined as the average difference between 
the ordinate pairs where the abscissas are equal. 



Node with highest degree 




PlaNet^^,. at step 7 PlaNet^^„ at step 8 



Figure 6: A partial step of the synthesis process. 
When the language L4 has to connect itself with 
one of the nodes in the set Vc it does so with the 
one having the highest degree (=3) rather than with 
others in order to achieve preferential attachment 
which is the working principle of our algorithm 
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Figure 7: Degree distribution of the nodes in 
Vc for both PlaNets^n, PlaNet, and when the 
model incorporates no preferential attachment; for 
PlaNetsyn, e = 0.0701 and the results are averaged 
over 100 simulation runs 



able) then the mean error would have been 0.35 
(35% variation on an average). 

6 Conclusions, Discussion and Future 
Work 

In this paper, we have analyzed and synthesized 
the consonant inventories of the world's languages 
in terms of a complex network. We dedicated the 
preceding sections essentially to, 

• Represent the consonant inventories through 
a bipartite network called PlaNet, 

• Provide a systematic study of certain impor- 
tant properties of the consonant inventories 
with the help of PlaNet, 



• Propose analytical explanations for the two 
regime power law curves (obtained from 
PlaNet) on the basis of the distribution of the 
consonant inventory size over languages to- 
gether with the principle of preferential at- 
tachment, 

• Provide a simplified mathematical model to 
support our analytical explanations, and 

• Develop a synthesis model for PlaNet based 
on preferential attachment where the conso- 
nant inventory size distribution is known a 
priori. 

We believe that the general explanation pro- 
vided here for the two regime power law is a fun- 
damental result, and can have a far reaching im- 
pact, because two regime behavior is observed in 
many other networked systems. 

Until now we have been mainly dealing with the 
computational aspects of the distribution of conso- 
nants over the languages rather than exploring the 
real world dynamics that gives rise to such a distri- 
bution. An issue that draws immediate attention is 
that how preferential attachment, which is a gen- 
eral phenomenon associated with network evolu- 
tion, can play a prime role in shaping the conso- 
nant inventories of the world's languages. The an- 
swer perhaps is hidden in the fact that language is 
an evolving system and its present structure is de- 
termined by its past evolutionary history. Indeed 
an explanation based on this evolutionary model, 
with an initial disparity in the distribution of con- 
sonants over languages, can be intuitively verified 
as follows - let there be a language community 
of N speakers communicating among themselves 
by means of only two consonants say /k/ and /g/. 
If we assume that every speaker has / descendants 
and language inventories are transmitted with high 
fidelity, then after i generations it is expected that 
the community will consist of mU Ikl speakers and 
nZ* Igl speakers. Now ifm > n and / > 1, then for 
sufficiently large i, » nP. Stated differently, 
the /k/ speakers by far outnumbers the /g/ speak- 
ers even if initially the number of /k/ speakers is 
only slightly higher than that of the /g/ speakers. 
This phenomenon is similar to that of preferen- 
tial attachment where language communities get 
attached to, i.e., select, consonants that are already 
highly preferred. Nevertheless, it remains to be 
seen where from such an initial disparity in the dis- 
tribution of the consonants over languages might 



have originated. 

In this paper, we mainly dealt with the occur- 
rence principles of the consonants in the invento- 
ries of the world's languages. The work can be fur- 
ther extended to identify the co-occurrence likeli- 
hood of the consonants in the language inventories 
and subsequently identify the groups or commu- 
nities within them. Information about such com- 
munities can then help in providing an improved 
insight about the organizing principles of the con- 
sonant inventories. 
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